AITopics

Country: Asia > China > Heilongjiang Province > Harbin (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Soor, Sampriti, Ghosh, Suklav, Sur, Arijit

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

arXiv.org Artificial IntelligenceDec-10-2025

Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.

large language model, machine learning, natural language, (21 more...)

2512.08123

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)

WIREDNov-28-2025, 10:00:00 GMT

Poems Can Trick AI Into Helping You Make a Nuclear Weapon

It turns out all the guardrails in the world won't protect a chatbot from meter and rhyme. You can get ChatGPT to help you build a nuclear bomb if you simply design the prompt in the form of a poem, according to a new study from researchers in Europe. The study, Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models (LLMs)," comes from Icaro Lab, a collaboration of researchers at Sapienza University in Rome and the DexAI think tank. According to the research, AI chatbots will dish on topics like nuclear weapons, child sex abuse material, and malware so long as users phrase the question in the form of a poem. "Poetic framing achieved an average jailbreak success rate of 62 percent for hand-crafted poems and approximately 43 percent for meta-prompt conversions," the study said. The researchers tested the poetic method on 25 chatbots made by companies like OpenAI, Meta, and Anthropic . It worked, with varying degrees of success, on all of them. WIRED reached out to Meta, Anthropic, and OpenAI for a comment but didn't hear back. The researchers say they've reached out as well to share their results. AI tools like Claude and ChatGPT have guardrails that prevent them from answering questions about "revenge porn" and the creation of weapons-grade plutonium. But it's easy to confuse those guardrails by adding " adversarial suffixes " to a prompt. Basically, add a bunch of extra junk to a question and it confuses the AI and bypasses its safety systems. The poetry jailbreak is similar. "If adversarial suffixes are, in the model's eyes, a kind of involuntary poetry, then real human poetry might be a natural adversarial suffix," the team at Icaro Lab, the researchers behind the poetry jailbreak, tell WIRED. "We experimented by reformulating dangerous requests in poetic form, using metaphors, fragmented syntax, oblique references.

large language model, machine learning, natural language, (18 more...)

WIRED

Country:

North America > United States > California (0.05)
Europe > Slovakia (0.05)
Europe > Czechia (0.05)

Genre: Research Report > New Finding (0.70)

Industry:

Government > Military (1.00)
Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Neural Information Processing SystemsNov-20-2025, 07:02:12 GMT

f545448535dfde4f9786555403ab7c49-Paper-Conference.pdf

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Country:

North America > Mexico > Puebla (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > Maryland (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Mura, Raffaele, Piras, Giorgio, Lukošiūtė, Kamilė, Pintor, Maura, Karbasi, Amin, Biggio, Battista

LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

arXiv.org Artificial IntelligenceOct-31-2025

Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

large language model, machine learning, natural language, (20 more...)

2510.08604

Country:

North America > United States (0.46)
Europe (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Consumer Health (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Yang, Xiaoxue, Stevanoski, Bozhidar, Meeus, Matthieu, de Montjoye, Yves-Alexandre

Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses

arXiv.org Artificial IntelligenceOct-17-2025

Large language models (LLMs) are increasingly deployed in real-world applications ranging from chatbots to agentic systems, where they are expected to process untrusted data and follow trusted instructions. Failure to distinguish between the two poses significant security risks, exploited by prompt injection attacks, which inject malicious instructions into the data to control model outputs. Model-level defenses have been proposed to mitigate prompt injection attacks. These defenses fine-tune LLMs to ignore injected instructions in untrusted data. We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses. Checkpoint-GCG enhances the Greedy Coordinate Gradient (GCG) attack by leveraging intermediate model checkpoints produced during fine-tuning to initialize GCG, with each checkpoint acting as a stepping stone for the next one to continuously improve attacks. First, we instantiate Checkpoint-GCG to evaluate the robustness of the state-of-the-art defenses in an auditing setup, assuming both (a) full knowledge of the model input and (b) access to intermediate model checkpoints. We show Checkpoint-GCG to achieve up to $96\%$ attack success rate (ASR) against the strongest defense. Second, we relax the first assumption by searching for a universal suffix that would work on unseen inputs, and obtain up to $89.9\%$ ASR against the strongest defense. Finally, we relax both assumptions by searching for a universal suffix that would transfer to similar black-box models and defenses, achieving an ASR of $63.9\%$ against a newly released defended model from Meta.

large language model, machine learning, natural language, (19 more...)

2505.15738

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Neural Information Processing SystemsOct-10-2025, 21:42:48 GMT

Refusal in Language Models Is Mediated by a Single Direction Andy Arditi Independent Oscar Obeso

While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.

arxiv preprint arxiv, instruction, refusal, (14 more...)

Country:

North America > Mexico > Puebla (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > Maryland (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 13:14:14 GMT

aeae9df8e6bfe7350160bb42965dbd44-Paper-Conference.pdf

arxiv preprint arxiv, gradient, representation, (14 more...)

Country: Asia > China > Heilongjiang Province > Harbin (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Yamaguchi, Kureha, Etheridge, Benjamin, Arditi, Andy

Adversarial Manipulation of Reasoning Models using Internal Representations

arXiv.org Artificial IntelligenceAug-29-2025

Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation.

large language model, machine learning, natural language, (18 more...)

2507.03167

Genre: Research Report > New Finding (0.86)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.94)
Health & Medicine > Therapeutic Area > Immunology (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Biswas, Sajib, Nishino, Mao, Chacko, Samuel Jacob, Liu, Xiuwen

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

arXiv.org Artificial IntelligenceAug-21-2025

As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly optimizes relaxed one-hot encodings of the adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring that the optimized one-hot encoding of each token always remains within the probability simplex. We provide theoretical proof of convergence for our proposed method and implement an efficient algorithm that effectively jailbreaks several widely used LLMs. Our method achieves higher success rates and faster convergence compared to three state-of-the-art baselines, evaluated on five open-source LLMs and four adversarial behavior datasets curated for evaluating jailbreak methods. In addition to individual prompt attacks, we also generate universal adversarial suffixes effective across multiple prompts and demonstrate transferability of optimized suffixes to different LLMs.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2508.14853

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety (0.93)
Government (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)